Following the tutorial at:


In [24]:
import pandas as pd

In [25]:
# There are two data structures in pandas, Series and DataFrames
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])

In [26]:
pd.DataFrame({"City Name": city_names, "Population": population})


Out[26]:
City Name Population
0 San Francisco 852469
1 San Jose 1015785
2 Sacramento 485199

In [27]:
# importing an existing csv file into DataFrame
california_housing_dataframe = pd.read_csv(
    "https://storage.googleapis.com/mledu-datasets/california_housing_train.csv",
    sep=","
)

In [28]:
california_housing_dataframe.shape


Out[28]:
(17000, 9)

In [29]:
california_housing_dataframe.head()


Out[29]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0

In [30]:
california_housing_dataframe.hist('housing_median_age')


Out[30]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f1ce0fe0e80>]],
      dtype=object)

Accessing Data

You can access DataFrame data using familiar Python dict/list operations:


In [31]:
cities = pd.DataFrame({'City Name': city_names, 'Population': population})
print(type(cities['City Name']))
cities['City Name']


<class 'pandas.core.series.Series'>
Out[31]:
0    San Francisco
1         San Jose
2       Sacramento
Name: City Name, dtype: object

In [32]:
print(type(cities["City Name"][1]))
cities["City Name"][1]


<class 'str'>
Out[32]:
'San Jose'

In [33]:
print(type(cities[0:2]))
cities[0:2]


<class 'pandas.core.frame.DataFrame'>
Out[33]:
City Name Population
0 San Francisco 852469
1 San Jose 1015785

Manipulating Data

You may apply Python's basic arithmetic operations to Series. For example:


In [36]:
population / 1000


Out[36]:
0     852.469
1    1015.785
2     485.199
dtype: float64

In [37]:
import numpy as np
np.log(population)


Out[37]:
0    13.655892
1    13.831172
2    13.092314
dtype: float64

In [40]:
cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
cities['Population density'] = cities['Population'] / cities['Area square miles']
cities


Out[40]:
City Name Population Area square miles Population density
0 San Francisco 852469 46.87 18187.945381
1 San Jose 1015785 176.53 5754.177760
2 Sacramento 485199 97.92 4955.055147

In [39]:
population.apply(lambda val: val > 1000000)


Out[39]:
0    False
1     True
2    False
dtype: bool

Exercise #1

Modify the cities table by adding a new boolean column that is True if and only if both of the following are True:

  • The city is named after a saint.
  • The city has an area greater than 50 square miles.

Note: Boolean Series are combined using the bitwise, rather than the traditional boolean, operators. For example, when performing logical and, use & instead of and.

Hint: "San" in Spanish means "saint."


In [46]:
cities['is saint and wide'] = (cities['Area square miles'] > 50) & (cities['City Name'].apply(lambda name: name.startswith("San")))
cities


Out[46]:
City Name Population Area square miles Population density is saint and wide
0 San Francisco 852469 46.87 18187.945381 False
1 San Jose 1015785 176.53 5754.177760 True
2 Sacramento 485199 97.92 4955.055147 False

Indexes

Both Series and DataFrame objects also define an index property that assigns an identifier value to each Series item or DataFrame row.

By default, at construction, pandas assigns index values that reflect the ordering of the source data. Once created, the index values are stable; that is, they do not change when data is reordered.


In [47]:
city_names.index


Out[47]:
RangeIndex(start=0, stop=3, step=1)

In [48]:
cities.index


Out[48]:
RangeIndex(start=0, stop=3, step=1)

In [50]:
cities.reindex([2, 0, 1])


Out[50]:
City Name Population Area square miles Population density is saint and wide
2 Sacramento 485199 97.92 4955.055147 False
0 San Francisco 852469 46.87 18187.945381 False
1 San Jose 1015785 176.53 5754.177760 True

Reindexing is a great way to shuffle (randomize) a DataFrame. In the example below, we take the index, which is array-like, and pass it to NumPy's random.permutation function, which shuffles its values in place. Calling reindex with this shuffled array causes the DataFrame rows to be shuffled in the same way.


In [52]:
cities.reindex(np.random.permutation(cities.index))


Out[52]:
City Name Population Area square miles Population density is saint and wide
1 San Jose 1015785 176.53 5754.177760 True
2 Sacramento 485199 97.92 4955.055147 False
0 San Francisco 852469 46.87 18187.945381 False

Exercise #2

The reindex method allows index values that are not in the original DataFrame's index values. Try it and see what happens if you use such values! Why do you think this is allowed?


In [53]:
cities.reindex([4, 2, 1, 3, 0])


Out[53]:
City Name Population Area square miles Population density is saint and wide
4 NaN NaN NaN NaN NaN
2 Sacramento 485199.0 97.92 4955.055147 False
1 San Jose 1015785.0 176.53 5754.177760 True
3 NaN NaN NaN NaN NaN
0 San Francisco 852469.0 46.87 18187.945381 False

In [ ]: